Day 28: 實現資料分析功能

2024 iThome 鐵人賽

DAY 28

Software Development

從無到有，LINE著不走系列第 28 篇

16th鐵人賽

ouoquq

團隊NUTC imac

2024-10-06 02:37:25

145 瀏覽

分享至

在第 28 天，我們將專注於對 Line Bot 收集到的資料進行基本的分析。這樣可以幫助我們更好地理解用戶行為，並提供更個性化的服務。

步驟 1：設計資料分析目標

確定分析目標：
- 分析用戶的消息頻率。
- 找出用戶常用的關鍵詞和話題。
- 計算活躍用戶數量和消息數量。
資料收集要求：
- 我們需要有用戶消息的時間戳、用戶 ID 和消息內容，這些已經在前一天完成的 user_messages 表中收集。

步驟 2：編寫資料分析腳本

**安裝 Pandas **：
- 使用 Pandas 庫來處理和分析資料。
```
pip install pandas
```

讀取資料並進行基礎分析：

使用 SQLite 提取資料並轉換為 Pandas DataFrame 進行分析。

import sqlite3
import pandas as pd

# 連接到資料庫
conn = sqlite3.connect('line_bot.db')

# 將 user_messages 表中的資料讀取到 DataFrame 中
df = pd.read_sql_query("SELECT * FROM user_messages", conn)

# 打印 DataFrame 查看資料
print(df.head())

conn.close()

步驟 3：分析用戶行為

消息頻率分析：

按時間戳分組來查看每天的消息數量，分析用戶活躍的時間段。

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = df['timestamp'].dt.date

# 按日期分組計算消息數量
message_frequency = df.groupby('date').size()

print("每日消息數量：")
print(message_frequency)

活躍用戶分析：

計算每個用戶的消息數量，找到最活躍的用戶。

user_activity = df.groupby('user_id').size().sort_values(ascending=False)

print("用戶活躍度（按消息數量排序）：")
print(user_activity)

關鍵詞分析：

對消息內容進行詞頻分析，找到用戶常用的關鍵詞。
安裝 nltk 庫來輔助進行文本分析。

pip install nltk

import nltk
from collections import Counter
from nltk.corpus import stopwords
import string

nltk.download('stopwords')
stop_words = set(stopwords.words('english')) | set(stopwords.words('chinese'))  # 中英文停用詞

# 消息文本處理
all_messages = ' '.join(df['message'])
all_messages = all_messages.translate(str.maketrans('', '', string.punctuation))  # 去除標點符號

# 分詞並統計詞頻
words = all_messages.split()
filtered_words = [word for word in words if word.lower() not in stop_words]
word_counts = Counter(filtered_words)

# 找到最常用的前 10 個詞
most_common_words = word_counts.most_common(10)
print("用戶最常用的詞：")
print(most_common_words)

步驟 4：生成報告

分析報告的呈現：

將分析結果輸出到 CSV 文件或生成圖表來更直觀地展示。

# 將分析結果保存為 CSV 文件
message_frequency.to_csv('message_frequency.csv')
user_activity.to_csv('user_activity.csv')

# 使用 Matplotlib 繪製圖表
import matplotlib.pyplot as plt

# 每日消息數量折線圖
message_frequency.plot(kind='line', title='每日消息數量')
plt.xlabel('日期')
plt.ylabel('消息數量')
plt.savefig('daily_message_frequency.png')
plt.show()